-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 datapipes #165
S3 datapipes #165
Conversation
Can we get this into the torch 1.11.0 release? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add ninja cmake
into pip install
on our CI here
Lines 43 to 46 in e53afb7
- name: Install dependencies | |
run: | | |
pip3 install -r requirements.txt | |
pip3 install --pre torch -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html |
TorchData is going to do a release after torch 1.11 (March maybe). If this PR is landed before that, we can squeeze it into TorchData release |
Note that, to build torchdata with S3 IO, we'll need pybind11 and "aws-sdk-cpp:s3;transfer" installed. We also need the BUILD_S3 env var set to "ON" to trigger build. |
If so, we need to add To turn on Lines 49 to 50 in e53afb7
|
I've written a README on how to install dependencies and build and use S3 IO datapipes. torchdata/datapipes/iter/load/README.md For aws-sdk-cpp:
How should we install this dependency? |
Add a separate step before building Lines 47 to 48 in e53afb7
|
Added another step to install aws-sdk-cpp. I wonder why install test requirements before building instead of after building and before tests? |
You need to compile dependency before building TorchData, am I right? Except, you want to do a more fancy feature to compile aws and provide S3 IO whenever such DataPipe is used. (I think it's doable but needs to do some research, we should land this PR first then figure it out). |
I meant that I noticed the following step (already existing in ci) is before building torchdata:
Should we move it to after building torchdata and before testing? |
Looks like the CI still doesn't install |
Without SDK installed any usage of S3 pipes fails with
|
Should be nice error instead |
Hi @VitalyFedyunin, I'm helping with this PR while @ydaiming is on vacation, would you recommend we raise a Something like this?
|
Yes. And, you may want to check if
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, we would need to check final binary size and what happens if we install python egg on the system without AWS library.
@ejguan , let's make the mergins process faster and catch PyTorch 1.11.0 release. The current issues are:
Please help us speed up the process, as we all want to include S3 IO datapipes in the coming torchdata release. Thank you! |
PyTorch is doing final branch cut tomorrow. We can try to test this PR ASAP but I don't think it's possible.
What do you mean different set of commands? Can you list the command you want to use?
You can add
Nope. The Error is pretty clear in the traceback, the Python used to compile
Can I have a code pointer which Error you are expecting? |
@ejguan Could you please trigger the CI again to check if the mypy error is cleared, and the correct python versions are found at building? Thanks. |
The logic for error is:
The tests passed in my environment, I'm not sure what's the different in the CI system. |
For Ubuntu and MacOS, the commands are as following, and already added: git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp
cd aws-sdk-cpp/
mkdir sdk-build
cd sdk-build
cmake .. -DCMAKE_BUILD_TYPE=Release -DBUILD_ONLY="s3;transfer"
make
make install while on Windows, the following commands are used: git clone --recurse-submodules https://github.com/aws/aws-sdk-cpp
mkdir sdk_build
cd sdk_build
cmake ..\aws-sdk-cpp -DCMAKE_BUILD_TYPE=Release -DBUILD_ONLY="s3:transfer"
"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\MSBuild\Current\Bin\MSBuild.exe" ALL_BUILD.vcxproj -p:Configuration=Release # depending on where MSBuild is
msbuild INSTALL.vcxproj -p:Configuration=Release # with administrative privilege reference: https://docs.aws.amazon.com/sdk-for-cpp/v1/developer-guide/setup-windows.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have looked through this PR again. I don't see the problem. @VitalyFedyunin Do you mind taking an extra look at this PR
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: ### Changes - Added S3FileLister and S3FileLoader IterDataPipes. - Added pybind11 build for s3 io cpp files and python scripts. ### TODO - [x] clean up setup files and link pybind11 in CMAKE_PREFIX automatically. - [x] remove `aws-cpp-sdk` dependency at build with `BUILD_S3` env var & pop exceptions when missing dependencies at usage. - [x] new api changes for list_files. - [x] clean up cpp files (naming, new structure, new logic etc.) - [x] expose timeouts, regions. - [x] thorough tests - [x] different correct usage: bucket (with or without / at last), folder (with or without / at last), prefix, item. - [x] different incorrect usage: non-existing files, wrong s3 urls, etc. - [x] region changes - [x] choice of public datasets - [x] benchmarks - [x] performance test - [x] README.md - [x] user guide & recommendations - [x] dependencies Pull Request resolved: pytorch#165 Reviewed By: NivekT Differential Revision: D35095053 Pulled By: ejguan fbshipit-source-id: 36cd3f850cd5a8e79d7d28291ffb2b73e5cea3e0
@ejguan has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Changes
TODO
aws-cpp-sdk
dependency at build withBUILD_S3
env var & pop exceptions when missing dependencies at usage.